Goto

Collaborating Authors

 input pipeline


tf.data service: A Case for Disaggregating ML Input Data Processing

arXiv.org Artificial Intelligence

Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy.


Posit AI Blog: Pre-processing layers in keras: What they are and how to use them

#artificialintelligence

Data pre-processing: What you do to the data before feeding it to the model. Where, exactly, should pre-processing stop, and the model begin? Are steps like normalization, or various numerical transforms, part of the model, or the pre-processing? In sum, the line between what is pre-processing and what is modeling has always, at the edges, felt somewhat fluid. In this situation, the advent of keras pre-processing layers changes a long-familiar picture.


Overcoming ML Data Preprocessing Bottlenecks With gRPC

#artificialintelligence

One of the measures of the health of a deep learning project is the degree to which it utilizes the training resources that it was allocated. Whether you are training in the cloud or on your own private infrastructure, training resources cost money, and any block of time in which they are left idle represents a potential opportunity to increase training throughput and overall productivity. This is particularly true for the training accelerator -- typically the most expensive training resource -- whether it be a GPU, a Google TPU, or a Habana Gaudi. This blog is a sequel to a previous post on the topic of Overcoming Data Preprocessing Bottlenecks in which we addressed the undesired scenario in which your training accelerator, henceforth assumed to be a GPU, finds itself idle while it waits for data input from an overly tasked CPU. The post covered several different ways of addressing this type of bottleneck and demonstrated them on a toy example, all the while emphasizing that the best option would very much depend on the specifics of the model and project at hand.


How Parallelization and Large Batch Size Improve the Performance of Deep Neural Networks.

#artificialintelligence

Large Batch Size had till recently been viewed as a deterrent for good accuracy. However recent studies show that increasing the batch size can significantly reduce the training time while maintaining a considerable level of accuracy. In this blog, we draw on our inferences from four such technical papers. The RMSprop Warm-up phase is used to address the optimization difficulty at the start of the training. The update rule demonstrated below utilizes both the Stochastic Gradient Descent (SGD) along the RMSprop optimization algorithm.


Deep Dive into TensorBoard: Tutorial With Examples - neptune.ai

#artificialintelligence

Start by clearing the logs, alternatively you can use timestamped log folders. After that specify the log directory and create a tf.summary.create_file_writer


Classify structured data with feature columns

#artificialintelligence

This tutorial demonstrates how to classify structured data (e.g. We will use Keras to define the model, and feature columns as a bridge to map from columns in a CSV to features used to train the model. We will use a simplified version of the PetFinder dataset. There are several thousand rows in the CSV. Each row describes a pet, and each column describes an attribute.


Debugging a Machine Learning model written in TensorFlow and Keras

#artificialintelligence

In this article, you get to look over my shoulder as I go about debugging a TensorFlow model. I did a lot of dumb things, so please don't judge. You can see the final (working) model on GitHub. I'm building a model to predict lightning 30 minutes into the future and plan to present it at the American Meteorological Society. The basic idea is to create 64x64 image patches around each pixel of infrared and Global Lightning Mapper (GLM) GOES-16 data and label the pixel as "has_ltg 1" if the lighting image actually occurs 30 minutes later within a 16x16 image patch around the pixel. A model trained in this way can be used to predict lightning 30 minutes ahead in real-time given the current infrared and GLM data.


Lessons for Improving Training Performance -- Part 1

#artificialintelligence

Nine months ago, as part of a joint reference architecture launch with Nvidia, Pure Storage published TensorFlow deep learning performance results. The goal of creating a joint architecture with Nvidia was to identify and solve performance bottlenecks present in an end-to-end deep learning environment -- especially at scale. During creation of our reference architecture, my team identified and improved performance issues across storage, networking, and compute. Our system is a physical entity, and everything from cabling configuration and MTU size to Tensorflow prefetch buffer size can impact performance. The software and hardware stack in our test environment.


VAEs! Generating images with Tensorflow – Towards Data Science

#artificialintelligence

In my previous post I covered the theory behind Variational Autoencoders. It's time now to get our hands dirty and develop some code that can lead us to a better comprehension of this technique. I decided to use Tensorflow since I want to improve my skills with it and adapt to the last changes that are being pushed towards the 2.0 version. Tensorflow (with the recently incorporated Keras API) provides a reasonable amount of image datasets that we can use to test the performance of our network. It is super simple to import them without loosing time on data preprocessing.


Lessons for Improving Training Performance -- Part 2

#artificialintelligence

Over the past nine months, the input pipeline part of deep learning training jobs in TensorFlow has become significantly more efficient. In this post, we investigate the performance impact of those TensorFlow changes and discuss futures on the horizon that will continue to impact full-stack performance. In Part 1 of this blog, we discussed the performance benefits of switching to lower precision and higher batch size during training. Both precision and batch size have "optimal" values for various workloads which have evolved over time. Beyond those major, well-known parameters, the entire software stack is evolving to improve training job throughput.